Nvidia Volta - 架构看点 | 自由微信

Nvidia Volta - 架构看点

原创 2017-05-11 唐杉 StarryHeavensAbove

点击上方“StarryHeavensAbove”快速关注

Nvidia CEO黄仁勋的GTC主旨演讲凌晨结束，股票大涨，媒体炸锅。同时，devblogs.nvidia.com上刊出一篇文章“Inside Volta: The World’s Most Advanced Data Center GPU”，比较详细的介绍了Nvidia最新的Volta架构。这篇文章非常值得一读，推荐大家点击原文链接好好看看。当然你也可以等一等，估计明天就会出现很多对这篇文章的翻译。这里我就不做翻译了，而是想快速和大家分享一下在架构层面我觉得比较重要的地方，供大家参考。

1. Key Features

按照我们的习惯，首先还是先看一下Volta的关键特性，这应该也是Nvidia最引以为豪的地方。这里做了一些删减，第一个feature包括了最多硬件架构上的改变，后面我来详细介绍。

New Streaming Multiprocessor (SM) Architecture Optimized for Deep Learning Volta features a major new redesign of the SM processor architecture that is at the center of the GPU....With independent, parallel integer and floating point datapaths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations..... Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming.
Second-Generation NVLink™ ... supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s. ...
HBM2 Memory: Faster, Higher Efficiency Volta’s highly tuned 16GB HBM2 memory subsystem delivers 900 GB/sec peak memory bandwidth. ...
Volta Multi-Process Service Volta Multi-Process Service (MPS) is a new feature of the Volta GV100 architecture providing hardware acceleration of critical components of the CUDA MPS server, enabling improved performance, isolation, and better quality of service (QoS) for multiple compute applications sharing the GPU. ...
Enhanced Unified Memory and Address Translation Services GV100 Unified Memory technology in Volta GV100 includes new access counters to allow more accurate migration of memory pages to the processor that accesses the pages most frequently, improving efficiency for accessing memory ranges shared between processors. ...
Cooperative Groups and New Cooperative Launch APIs ....allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions. ...
Maximum Performance and Maximum Efficiency Modes ...
Volta Optimized Software ...Volta-optimized versions of GPU accelerated libraries such as cuDNN, cuBLAS, and TensorRT leverage the new features of the Volta GV100 architecture to deliver higher performance for both deep learning and High Performance Computing (HPC) applications....

从这些feature来看，Nvidia现在越来越强调对Deep Learning的支持，这些软硬件feature让它在datacenter的training方面还是很有优势的。下面这个图也对比了GV100（Volta）和之前架构的改进。其中比较有趣的点包括，增加了Tensor Core和相应的性能指标；芯片巨大的面积815mm和先进工艺12nm FFN。

信息来自devblogs.nvidia.com

Nvidia的大杀器使用了12nm的工艺，芯片面积还是达到815mm。是否良率能够满足要求呢？转念一想，也许Nvidia并不在乎这个，反正芯片卖天价也有的是人抢。

2. Tensor Cores

当然，Volta架构中最吸引眼球的地方就是新增的Tensor Cores。下图说明了它的功能。

信息来自devblogs.nvidia.com

从这里的公式和图示来看，Volta中的一个Tensor Core可以进行4x4矩阵的成累加运算。

“Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 multiply and FP32 accumulate) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock. This is a dramatic 8X increase in throughput for deep learning applications per SM compared to Pascal GP100 using standard FP32 operations, resulting in a total 12X increase in throughput for the Volta V100 GPU compared to the Pascal P100 GPU. Tensor Cores operate on FP16 input data with FP32 accumulation. ”

具体到硬件架构，文章给出了一个简化的图示。

信息来自devblogs.nvidia.com

根据我们之前对Google TPU的分析（脉动阵列 - 因Google TPU获得新生），这里的tensor core和Google TPU的脉动阵列PE（cell）相比，主要差别在于：1. Tensor Core粒度要大很多，TPU的每个Cell只完成一个标量成累加；2. Tensor Core的精度高，TPU的Cell只支持8bit和16bit定点数操作。毕竟TPU只是面向data center的inference应用的。再放一下我画的TPU cell猜想图。

从并行执行的角度来看，多个Tensor Core可以被一个“warp”同时使用：

“During program execution, multiple Tensor Cores are used concurrently by a full warp of execution. The threads within a warp provide a larger 16x16x16 matrix operation to be processed by the Tensor Cores.”

Tensor Core确实是专门为Deep Learning设计的，不过这么大的运算粒度能否很好的和实际的应用匹配呢？相信Nvidia的选择一定有他的道理。

3. Independent Thread Scheduling

Volta的另一个重要架构改动是所谓独立的线程调度。其实，和增加了Tensor Cores相比，我觉得这个改动是架构上更大的动作。在Volta之前，Nvidia一直使用的SIMT（Single Instruction, Multiple Threads）架构最重要的特征就是一个“warp”中的32个线程是共享一个PC（Program Counter）和栈（Stack）的。而在Volta中，则每个线程有了自己的PC和Stack。如下图所示。

信息来自devblogs.nvidia.com

如果你不了解SIMT是怎么回事，你可以找找Nvidia CUDA的介绍作为参考。简单来说，Nvidia的SIMT就是一个处理器上同时并行运行多个线程（thread），但这些线程执行相同的一段程序代码，处理不同的数据。

由于Volta之前的架构中多个线程共享PC和Stack，在线程调度的时候粒度是比较粗的。比如下面这种情况。当出现分支的适合，对于threadIdx < 4的线程就要执行A，B；否则是X，Y。Pascal的一个“warp”的32个线程共享一个PC，并结合一个“活动掩码”（active mask），指定任何给定时间哪个线程是活动的。这意味着不同的程序分支使某些线程无效。“warp”的不同部分顺序执行，直到再次收敛，此时掩模被恢复，所有线程再次一起运行。

信息来自devblogs.nvidia.com

在volta中，情况则发生了变化，如下图所示，可以实现不同的调度方式，程序中if和else分支的语句现在可以及时交错。当然，这里程序的执行还是SIMT方式，即在任何给定的时钟周期，warp中的所有活动线程执行相同的指令，从而保留以前架构的执行效率。

从这里可以看出，独立的线程调度的重点是实现了更细致的调度粒度，当然代价是增加了很多硬件开销（每个thread都要有自己的PC和Stack）。

给每个thread一个独立的PC和Stack，这个代价可不小。Independent Thread Scheduling是不是能够发挥最大的作用，软件工具的作用至关重要。这些年Nvidia在软件工具上下足了功夫，应该是很有信心吧。

4. 其它一些改进

除了上述两个我认为最重要的改进之外，Volta架构还有下面几个改进值得注意。

首先，是能够同时执行FP32和INT32指令：

“Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. Dependent instruction issue latency is also reduced for core FMA math operations, requiring only four clock cycles on Volta, compared to six cycles on Pascal.”

其次，是增强L1 Data Cache和Shared Memory，可以支持更灵活的cache和share memory的使用，可以充分利用shared memory在性能上的优势（当然需要程序员自己管理）。

第三，是文章中介绍的“STARVATION-FREE ALGORITHMS”。这个我没有太多感觉，大家感兴趣的话可以看看原文。

暂时就这么多了，欢迎大家留言和我交流。

T.S.

题图来自网络，版权归原作者所有

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…